The goal of this notebook is to introduce you to the TCGA Annotations BigQuery table. You can find more detail about Annotations on the TCGA Wiki, but the key things to know are:
The current set of annotation classifications includes: Redaction, Notification, CenterNotification, and Observation. The authority for Redactions and Notifications is the BCR (Biospecimen Core Resource), while CenterNotifications can come from any of the data-generating centers (GSC or GCC), and Observations from any authorized TCGA personnel. Within each classification type, there are several categories.
We will look at these further by querying directly on the Annotations table.
Note that annotations about patients, samples, and aliquots are separate from the clinical, biospecimen, and molecular data, and most patients, samples, and aliquots do not in fact have any annotations associated with them. It can be important, however, when creating a cohort or analyzing the molecular data associated with a cohort, to check for the existence of annotations.
As usual, in order to work with BigQuery, you need to import the python bigquery module (gcp.bigquery) and you need to know the name(s) of the table(s) you are going to be working with:
In [1]:
import gcp.bigquery as bq
annotations_BQtable = bq.Table('isb-cgc:tcga_201607_beta.Annotations')
In [2]:
%bigquery schema --table $annotations_BQtable
Out[2]:
In [3]:
%%sql
SELECT itemTypeName, COUNT(*) AS n
FROM $annotations_BQtable
GROUP BY itemTypeName
ORDER BY n DESC
Out[3]:
The length of the barcode in the itemBarcode
field will depend on the value in the itemTypeName
field: if the itemType is "Patient", then the barcode will be something like TCGA-E2-A15J
, whereas if the itemType is "Aliquot", the barcode will be a full-length barcode, eg TCGA-E2-A15J-10A-01D-a12N-01
.
The next most important pieces of information about an annotation are the "classification" and "category". Each of these comes from a controlled vocabulary and each "classification" has a specific set of allowed "categories".
One important thing to understand is that if an aliquot carries some sort of disqualifying annotation, in general all other data from other samples or aliquots associated with that same patient should still be usable. On the other hand, if a patient carries some sort of disqualifying annotation, then that information should be considered prior to using any of the samples or aliquots derived from that patient.
To illustrate this, let's look at the most frequent annotation classifications and categories when the itemType is Patient:
In [4]:
%%sql
SELECT
annotationClassification,
annotationCategoryName,
COUNT(*) AS n
FROM
$annotations_BQtable
WHERE
( itemTypeName="Patient" )
GROUP BY
annotationClassification,
annotationCategoryName
HAVING ( n >= 50 )
ORDER BY
n DESC
Out[4]:
The results of the previous query indicate that the majority of patient-level annotations are "Notifications", most frequently regarding prior malignancies. In most TCGA publications, "history of unacceptable prior treatment" and "item is noncanonical" notifications are treated as disqualifying annotations, and all data associated with those patients is not used in any analysis.
Let's make a slight modification to the last query to see what types of annotation categories and classifications we see when the item type is not patient:
In [5]:
%%sql
SELECT
annotationClassification,
annotationCategoryName,
itemTypeName,
COUNT(*) AS n
FROM
$annotations_BQtable
WHERE
( itemTypeName!="Patient" )
GROUP BY
annotationClassification,
annotationCategoryName,
itemTypeName
HAVING ( n >= 50 )
ORDER BY
n DESC
Out[5]:
The results of the previous query indicate that the vast majority of annotations are at the aliquot level, and more specifically were submitted by one of the data-generating centers, indicating that the data derived from that aliquot is "DNU" (Do Not Use). In general, this should not affect any other aliquots derived from the same sample or any other samples derived from the same patient.
We see in the output of the previous query that a Notification that an "Item is noncanonical" can be applied to different types of items (eg slides and analytes). Let's investigate this a little bit further, for example let's count up these types of annotations by study (ie tumor-type):
In [6]:
%%sql
SELECT
Study,
COUNT(*) AS n
FROM
$annotations_BQtable
WHERE
( annotationCategoryName="Item is noncanonical" )
GROUP BY
Study
ORDER BY
n DESC
Out[6]:
and now let's pick one of these tumor types, and delve a little bit further:
In [7]:
%%sql
SELECT
itemTypeName,
COUNT(*) AS n
FROM
$annotations_BQtable
WHERE
( annotationCategoryName="Item is noncanonical"
AND Study="OV" )
GROUP BY
itemTypeName
ORDER BY
n DESC
Out[7]:
As described above, an annotation is specific to a single TCGA "item" and the fields itemTypeName
and itemBarcode
are the most important keys to understanding which TCGA item carries the annotation. Because we use the fields ParticipantBarcode
, SampleBarcode
, and AliquotBarcode
throughout our other TCGA BigQuery tables, we have added them to this table as well, but they should be interpreted with some care: when an annotation is specific to an aliquot (ie itemTypeName="Aliquot"
), the ParticipantBarcode
, SampleBarcode
, and AliquotBarcode
fields will all be set, but this should not be interpreted to mean that the annotation applies to all data derived from that patient.
This will be illustrated with the following two queries which extract information pertaining to a few specific patients:
In [8]:
%%sql
SELECT
Study,
itemTypeName,
itemBarcode,
annotationCategoryName,
annotationClassification,
ParticipantBarcode,
SampleBarcode,
AliquotBarcode,
LENGTH(itemBarcode) AS n
FROM
$annotations_BQtable
WHERE
( ParticipantBarcode="TCGA-61-1916" )
ORDER BY n ASC
Out[8]:
In [9]:
%%sql
SELECT
Study,
itemTypeName,
itemBarcode,
annotationCategoryName,
annotationClassification,
ParticipantBarcode,
SampleBarcode,
AliquotBarcode,
LENGTH(itemBarcode) AS n
FROM
$annotations_BQtable
WHERE
( ParticipantBarcode="TCGA-GN-A261" )
ORDER BY n ASC
Out[9]:
As you can see in the results returned from the previous two queries, the SampleBarcode and the AliquotBarcode fields may or may not be filled in, depending on the itemTypeName.
In [10]:
%%sql
SELECT
Study,
itemTypeName,
itemBarcode,
annotationCategoryName,
annotationClassification,
annotationNoteText,
ParticipantBarcode,
SampleBarcode,
AliquotBarcode,
LENGTH(itemBarcode) AS n
FROM
$annotations_BQtable
WHERE
( ParticipantBarcode="TCGA-RS-A6TP" )
ORDER BY n ASC
Out[10]:
In this example, there is just one annotation relevant to this particular patient, and one has to look at the annotationNoteText
to find out what the potential issue may be with this particular analyte. Any aliquots derived from this blood-normal analyte might need to be used with care.
In [ ]: